ENRON Emails - Exploratory Data Analysis

1. READING DATA and Creating Shallow & Deep Features for analysis

1.1 ) Creating Shallow Features from dataframe

1.2 ) Creating Deep Features from dataframe

2: Data Analysis

2.1 ) Visualzing count of emails over years

2.2 ) Simple Sentiment Analysis (with Polarity and Subjectivity)

For preliminary understanding TextBlob is used to understand "Polarity" and "Subjectivity" accross emails. Looking at below chart ('Sentiment Polarity over years' and 'Sentiment Subjectivity over years'), it is clear that sentiments started fluctuating in year 2000.

In following section, we can deep dive to understand year 2000 and 2001 closely

2.3 ) Sentiment Analysis between 2000 and 2001

It's more clear that in year 2000 and 2001, sentiment polarity and subjectivity was fluctuating around 0.1 amd 0.3 respectively with positive and negative spikes.

2.4 ) Sender Domain Sentiment Rate

To understand sender better as he is initiated of email chain, it's better to plot Sender Domain Sentiment Rate. It clearly display that all domain senders are floating between polarity 0 to 0.25 and subjectivity 0.3 to 0.5

2.5 ) ENRON sender sentiment rate

Looking at below charts it's clear that ENRON senders are having very low polarity and subjectivity score compare to other senders. It's clear display neagtive and opinion dominated envrionment

2.6 ) Top 5 Email Sender

It's clearly display only one sender and our data is imbalanced to do any analysis on senders as it only reflect (mostly) one person email pattern.

2.7 ) Top 5 Email Receiver

It's a balanced data set but having very low volume.

2.8 ) Top Sender-Receiver Pair

as data was low or more skewed for sender and receiver, below chart is plaotted to under any dominated pair (sender -receiver).

2.9 ) Top Sender-Receiver DOMAIN Pair

This is to understand email exchange between domains. (internal vs external emails)

2.10 ) Top Enron(Sender) -Receiver DOMAIN Pair

Above chart lead to an intution that how many emails exchhnaged between Enron(Sender) and other external domains. It clearly display external domain like 'hotmail.com,'austinits.com' and 'rr.com' on top category. As a second iteration, it will be worth while to deep dive on those specific domains emails to understand "sensitive information exchanges".

2.11 ) WordCloud visulization with top 50 words to understand words used in below 4 corpus

1)'Overall Email corpus' 2)'New Email', 3)'Reply Email' and 4)'Forward Email'

3.0 ) Topic Modeling

Using LDA to create topics. DTM - Document Term Matrix which is created at the time of Deep Feature creation. CV - Count Vectorizer which is created at the time of Deep Feature creation.


4.0 ) Conclusion & Next Step

2) While cleaning data, came accross lot of PII information (like Phone#, IP addresses, passwords, etc) and it will be good use case to understand emails with PII information. Machine Learning model can be developed to indetify such emails and increase protection on PII data.

3) Email Corpus and specifically Forward Email Type corpus can be further explored to understand sensitive information exchange with internal and external email addressed. With this data model can be delveoped to flag such external email communication.

4) Topic modeling can be further enahcned to categorize each emails. Kmean can be leveraged to underastand cluster accross email corpus.

5) Overal this email data with adding deep featured can be leveraged to